About the project

How are your feeling right now?

All of this is new and quite promising, I do not know anything about the subject and learning will be probably challenging. However I am eager to start. I have experienced some difficulties in this first task as I am unfamiliar with the whole thing. In the end, I succeeded to overcome those problems.

What do you expect to learn?

I’d like to learn about processing of large datasets to be used in my research. My specialization is in political science and organizations. So, I would like to learn mainly about working with statistics and also with data visualization.

Where did you hear about this course?

My supervisor has warmly recommended me to enrol in this course.

MY GITHUB REPOSITORY


Chapter 2: Regression and Model Validation.

2.1.Data Analysis.

2.1.1.Reading of the dataframe “learning2014” and exploration of its dimension and structure.

First of all, I read the new dataframe with the function read.table().

Then I use of the function dim() to show the dimension of the dataframe that is of 166 objects and 7 variables as explained in the above-mentioned data wrangling section:

## [1] 166   7

By typing the function str(), it shows the structure of the dataframe:

## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...

2.1.2.Graphical overview of the data and summary of their variables.

To visualize the data I type the functions install.packages() to install the visualization packages ggplot2 and GGally. Then by using the function library() I open them in the project.

install.packages(“ggplot2”)
install.packages(“GGally”)
library(ggplot2)

Visualizing and Exploring the dataframe.

The libraries are opened:

## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

By using the fast plotting function pairs(), we can see a scatterplot matrix resulting from the draws of all the possible scatterplots from the columns of the dataframe. Different colors are used for males and females.

This second plot matrix is more advanced, and it is made with ggpairs().

The summary of the variables:

##  gender       Age           Attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           Points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

Output interpretation and description, distribution of the variables and their in-between relations.

The females are almost double of the males who present a wider age range. The summary suggest a significant correlation for surf vs deep and points vs attitude.

2.1.3.Multiple Regression Model.

I have chosen three variables “attitude”, “deep learning” and “surface learning”, with the target variable “exam points” to fit a regression model analysis.

Drawing a plot matrix with ggpairs().

Fitting the regression models with three explanatory variables and running the summary:

## 
## Call:
## lm(formula = Points ~ Attitude + deep + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9168  -3.1487   0.3667   3.8326  11.3519 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.3551     4.7124   3.895 0.000143 ***
## Attitude      3.4661     0.5766   6.011 1.18e-08 ***
## deep         -0.9485     0.7903  -1.200 0.231815    
## surf         -1.0911     0.8360  -1.305 0.193669    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.313 on 162 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1876 
## F-statistic:  13.7 on 3 and 162 DF,  p-value: 5.217e-08

Commentary and interpretation of the results.

MISSING

2.1.4.Explanation of the relationships between the chosen explanatory variables and the target variable. With the summary of the fitted model (interpret the model parameters). Explaination and interpretation of the multiple R squared of the model.

The adjusted R square of 0.1876 denotes a poorly fitting function to the explanatory variables. Only attitude presents statistical significance.

2.1.5.Diagnostic plots

Residuals vs Fitted values.

Normal QQ-plot.

Residuals vs Leverage.

Explanation of the model’s assumptions and interpretation of their validity on the bases of the diagnostic plots.

A Multiple Linear Regression Model has few assumptions:
a linear relationship between the target variable and the explanatory variables, usually revealed by scatterplots;
a multivariate normality, which means that the residuals are normally distributed. The QQ-plot can reveal it;
the absence of multicollinearity, in other words, the explanatory variables are not highly correlated to each other.
homoscedasticity: or constant variance of errors. There is a similar variance of error terms across the values of the explanatory variable. A plot of standardized residuals versus predicted values shows if the points are equally distributed across all values of the dependent variables.
The diagnostic plots delivered the following observations:
In the residuals vs fitted values, the plot is utilized to check the assumption of linear relationship. An horizontal line, without distinct patterns is an indicator for a linear relationship, in this case, the red line is more or less horizontal at zero. Hence here linear relationship could be assumed.
In the normal QQ-plot the plot reveals the presence of multivariate normality. A good indication is if the residual points follow the straight dashed line. For the majority it is the case here, hence normality can also be assumed.
In the residuals vs leverage, the plot identifies the impact of a single observation on the model. Influential points lie at the upper or at the lower right corner in a position where they are influential against a regression line. In this case the points are on the left side of the plot, thus we can say that there is no leverage.


Chapter 3: Logistic Regression.

3.2.Data Analysis.

3.2.2.Reading of the dataframe “alc”, its description and printing of variables’ names.

In this chapter, we will analyze a dataset resulted from the data wrangling of another dataset about students’ performance in high school in mathematics and portugese language. Hereinafter are the variable of the dataset we are going to analyze:

##  [1] "school"      "sex"         "age"         "address"     "famsize"    
##  [6] "Pstatus"     "Medu"        "Fedu"        "Mjob"        "Fjob"       
## [11] "reason"      "nursery"     "internet"    "guardian"    "traveltime" 
## [16] "studytime"   "failures"    "schoolsup"   "famsup"      "paid"       
## [21] "activities"  "higher"      "romantic"    "famrel"      "freetime"   
## [26] "goout"       "Dalc"        "Walc"        "health"      "absences"   
## [31] "G1"          "G2"          "G3"          "alc_use"     "high_use"   
## [36] "probability" "prediction"

Alltogether the dataset has a dimension of:

## [1] 382  37

3.2.3.Choosing four interesting variables, and hypotesis’ formulation in relation to alcohol consumption.

I choose four variables of interest on which I build some hypotheses:

VARIABLE CHOSEN RELATED HYPOTHESIS
"goout" H1: Students who go out more have higher alcohol consumption.
"freetime" H2: Students who have more free time are more prone to drink.
"studytime" H3: The more the studytime, the less a student drinks.
"romantic" H4: Given the courting dynamics, romantics drink less.

3.2.4.Numerical and graphical exploration of variables distribution in relation to alcohol consumption.

Let’s start with a graphical overview of the dataset variable distribution, in relation to the alcohol consumption. By installing and calling "tidyr", "dplyr", and "ggplot2", and then by combining the function glimpse() via the pipe operator %>% to the plot generating function ggplot(); we get the following plot:

Cross-tabulations.

Let’s now focus on the cross-tabulations for the specific interest variables on which hypotheses were posed:

Go out and Alcohol use.
## # A tibble: 38 x 4
## # Groups:   goout [5]
##    goout alc_use count mean_grade
##    <int>   <dbl> <int>      <dbl>
##  1     1     1      14       11.1
##  2     1     1.5     3        7  
##  3     1     2       2       13.5
##  4     1     2.5     2       12  
##  5     1     3.5     1       10  
##  6     2     1      49       12.6
##  7     2     1.5    21       12.1
##  8     2     2      14       10.3
##  9     2     2.5     8       12.4
## 10     2     3       4       10.8
## # ... with 28 more rows
Free time and Alcohol use.
## # A tibble: 37 x 4
## # Groups:   freetime [5]
##    freetime alc_use count mean_grade
##       <int>   <dbl> <int>      <dbl>
##  1        1     1      10       10.3
##  2        1     1.5     1       16  
##  3        1     2       4       12.2
##  4        1     3       1       10  
##  5        1     4       1        8  
##  6        2     1      18       13.3
##  7        2     1.5    24       12.9
##  8        2     2       7       10.3
##  9        2     2.5     8       11.8
## 10        2     3       6       10.7
## # ... with 27 more rows
Study time and Alcohol use.
## # A tibble: 30 x 4
## # Groups:   studytime [4]
##    studytime alc_use count mean_grade
##        <int>   <dbl> <int>      <dbl>
##  1         1     1      21      12.2 
##  2         1     1.5    20      10   
##  3         1     2      17      10.1 
##  4         1     2.5    10      11.5 
##  5         1     3      12      10.1 
##  6         1     3.5    11       9.09
##  7         1     4       4      13   
##  8         1     5       5       9.6 
##  9         2     1      72      11.7 
## 10         2     1.5    35      12.1 
## # ... with 20 more rows
Romantic and Alcohol use.
## # A tibble: 18 x 4
## # Groups:   romantic [2]
##    romantic alc_use count mean_grade
##    <fct>      <dbl> <int>      <dbl>
##  1 no           1      98      12.3 
##  2 no           1.5    47      11.8 
##  3 no           2      35      11.3 
##  4 no           2.5    32      11.8 
##  5 no           3      25      10.4 
##  6 no           3.5    13      11.1 
##  7 no           4       6      10.3 
##  8 no           4.5     1      10   
##  9 no           5       4      10   
## 10 yes          1      42      11.1 
## 11 yes          1.5    22      11.2 
## 12 yes          2      24      11.2 
## 13 yes          2.5    12      11.7 
## 14 yes          3       7       8.71
## 15 yes          3.5     4       7.5 
## 16 yes          4       3      10   
## 17 yes          4.5     2      11   
## 18 yes          5       5      11.6

Bar plots.

Box plots.

Comments on findings and result comparation against previous hypotheses.

Overall, for the chosen variables, only the females seem to present outliers. As for the whiskers the females vary more than males except for the variable romantic. The only skewed plot is the romantic males’ one. Let’s now proceed to the hypotheses’ comparation.

"goout" and H1.

Concerning the variable "goout", I hypotesized (H1) that students who go out more experiences higher alcohol consumption. Overall, it seems that people who go out more, have higher consumptions of alcohol. The most starking differences are between class 1 and 3. But people at 2 have higher consumptions than people at 5.The higher consumption is registered at 3. So it is not self-evident that the more a student goes out, the more drinks, the hypothesis H1 is not entirely correct.

"freetime" and H2.

In regards to the variable "freetime", its related hypothesis (H2) was that students who have more free time are more prone to drink. Again, the levels for the answers 3 - 4 are much higher than 1 - 2 and 5. On a general level the hypothesis H2 seems correct. But the distributions of the results see a stark decrease at 5, which has even lower levels than at 2.

"studytime" and H3.

When it comes to the findings of the variable "studytime", it seems to corroborate the hypothesis (H3) according to which the more the studytime, the less a student drinks. The value in 2 is higher than 1 but the consumption decreases at the increasing of the studytime.

"romantic" and H4.

Finally, to the variable "romantic", I associated the hypothesis (H4) that romantics drink less than non-romantics, due to courting dynamics. The results seem to confirm the hypothesis H4. We see that non-romantics drink as much as twice compared to romantics, males percentage is higher than female one in this category, with a slightly larger gap in non-romantic.

3.2.5.Logistic regression.

By using logistic regression we statistically explore the relationship between the four selected variables and the binary high/low alcohol consumption variable as the target variable.

Summary of the fitted model and its interpretation.

## 
## Call:
## glm(formula = high_use ~ goout + freetime + studytime + romantic, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6927  -0.7743  -0.5479   1.0022   2.6060  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.31564    0.62593  -3.700 0.000216 ***
## goout        0.72871    0.12071   6.037 1.57e-09 ***
## freetime     0.08366    0.13415   0.624 0.532863    
## studytime   -0.58629    0.16508  -3.552 0.000383 ***
## romanticyes -0.18301    0.26722  -0.685 0.493442    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 399.98  on 377  degrees of freedom
## AIC: 409.98
## 
## Number of Fisher Scoring iterations: 4
Interpretation of the coefficients of the model as odd ratios, provision of confidence intervals for them.
## (Intercept)       goout    freetime   studytime romanticyes 
## -2.31564309  0.72871166  0.08366382 -0.58628767 -0.18300551

Odd ratio (OR) is obtained with the division of the odds of “success” (Y = 1) for students who have the property X, by the odds of “success” of those who have not. As OR quantifies the relations between X and Y, Odds higher than 1 indicates that X is positively associated with “success”. The odds ratios can be seen also as exponents of the coefficients of a logistic regression model.

Computation of the odds ratio (OR)

Computation of the confidence intervals for the coefficients by the function confint(), and exponentiation of the values by using exp().

## Waiting for profiling to be done...
##                  2.5 %     97.5 %
## (Intercept) -3.5691044 -1.1092870
## goout        0.4982923  0.9725916
## freetime    -0.1795761  0.3477794
## studytime   -0.9207772 -0.2716715
## romanticyes -0.7148612  0.3353712

Obtaining the odds ratio with their confidence intervals by using cbind():

##                     OR      2.5 %    97.5 %
## (Intercept) 0.09870269 0.02818108 0.3297940
## goout       2.07240892 1.64590816 2.6447898
## freetime    1.08726331 0.83562438 1.4159199
## studytime   0.55638895 0.39820941 0.7621045
## romanticyes 0.83276356 0.48926002 1.3984594

Result interpretation and comparison with previously formulated hypotheses.

Values bigger than 1 are seen fully in goout, freetime (except for 2,5%), and in 97.5% of romantic, here there is positive correlation. These results moslty confirmedv my hypotheses apart from the studytime.

3.2.6.Predictive power of the model.

First we use the function to predict() the probability of high use, after modifying the dataset 'alc' with the new integration we move to predict probabilities and classes, and to tabulate the target variables versus the predictions:

##     goout freetime studytime romantic probability prediction
## 373     2        3         1       no  0.23263069      FALSE
## 374     3        4         1       no  0.40585185      FALSE
## 375     3        3         3       no  0.16282192      FALSE
## 376     3        4         1      yes  0.36258869      FALSE
## 377     2        4         3       no  0.09258880      FALSE
## 378     4        3         2       no  0.42009575      FALSE
## 379     2        2         2       no  0.13429940      FALSE
## 380     1        1         2       no  0.06441395      FALSE
## 381     5        4         1       no  0.74578989       TRUE
## 382     1        4         1       no  0.13722123      FALSE
Cross tabulations and actual values vs predictions graphic.
##         prediction
## high_use FALSE TRUE
##    FALSE   247   21
##    TRUE     73   41

##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.64659686 0.05497382 0.70157068
##    TRUE  0.19109948 0.10732984 0.29842932
##    Sum   0.83769634 0.16230366 1.00000000
Training error, and result comments. Model performance vs guessing.

Accuracy measures the performance in binary classifications as the average number of correctly classified observations. The mean of incorrectly classified observations can be seen as a penalty function of the classifier: the less the better. In this section, first we define a loss function loss_func(), and then we apply it to probability = 0, probability = 1 and then to the prediction probabilities in alc.

## [1] 0.2984293
## [1] 0.7015707
## [1] 0.2460733

The first and third functions deliver better results than in the case of probability = 1. It works better than guessing.

3.2.7. Bonus: 10-fold model cross-validation.

Cross validation is a technique to assess how the results of a statistical analysis will generalize to an independent data set. In a cross validation, a sample of data is partitioned into complementary subsets (training, larger and testing, smaller), performing the analysis on the former and validating the results on the latter. The subsets used here are K = 10.

## [1] 0.2460733

With leave-one-out cross validation:

## [1] 0.2460733

with 10-fold cross validation:

## [1] 0.2460733

The ten-fold cross validation shows higher prediction error on the testing data compared to the training data. It is also lower than the 0.26 in the Datacamp exercise.

3.2.8. Super Bonus: performance comparation via cross-validation (PLOTS MISSING).

At first I use a logistic regression model with 22 predictors.

## [1] 0.2460733

The function is performed with leave-one-out cross validation.

## [1] 0.2539267

Here the result is given by ten-fold cross validation.

## [1] 0.2513089

With 15 predictors

The function is performed with leave-one-out cross validation.

## [1] 0.2696335

Here the result is given by ten-fold cross validation.

## [1] 0.2696335

With 10 predictors

The function is performed with leave-one-out cross validation.

## [1] 0.2670157

Here the result is given by ten-fold cross validation.

## [1] 0.2617801

With 5 predictors

The function is performed with leave-one-out cross validation.

## [1] 0.2460733

Here the result is given by ten-fold cross validation.

## [1] 0.2460733

Chapter 4: Clustering and Classification.

4.2.Data Analysis.

4.2.2.Loading of the Boston data from the MASS Package. Description, structure and dimension.

The dataset Boston, is about the housing values in the suburbs of the homonym city. I use the functions str() and dim() to explore the dataset.Here is its structure:

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

And here its dimension:

## [1] 506  14

4.2.3.Graphical overview of the data and variables’ summary.

Let’s have a look at the summary() of the variables:

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Using the function pairs() we obtain the following graphical overview:

From the plot above is a bit difficult to see relations between variables. Let’s try to use something else, for instance a correlation plot. By using the function corrplot() we can obtain a visual way to look at correlations. First we need to calculate the correlation matrix by using cor():

##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00

Now that we have the matrix, rounded to the first two digits, we can proceed to create the correlation plot by using corrplot(). Here is how it looks like:

Outputs’ description and interpretation of the variables’ distributions and relations.

The corrplot() provides us with a graphical overview of the Pearson’s correlation coefficient calculated with cor. This measure quantifies the degree to which an association tends to a certain pattern. In this case it summarize the strength of a linear association. As we see here, the value 0 means that two variables are uncorrelated. A value of -1 (in red) or 1 (in blue) shows that they are perfectly related.
As we can see here, the dimensions and intensity of colour of the dots visually shows the strenght of the linear associations. I used order = "hclust" as the ordering method for this correlation matrix as it makes the matrix more immediate to read. Among the strongest negative correlations there are: dis nox, dis indus, dis age, lstat rm, and lstat medv. On the contrary, among the strongest positive correlations we find: tax rad, tax indus, nox indus, nox age. Overall, only the variable chas seems to have very little if none statistical correlation at all.

4.2.4.Dataset standardization.

We scale the dataset by using the scale() function, then we can see the scaled variables with summary():

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

The function scale() operated on the variables by subtracting the columns means from the corresponding columns and dividing the difference with standard deviation. Here it was possible to scale the whole dataset as it contains only numerical values.

The class of the boston_scaled object is a:

## [1] "matrix"

so, to complete the procedure, we change the object into a data frame.

Creation of the categorical variable of the crime rate (with quantiles as break points).

To create the categorial variable, we use the function cut() together with quantile()to have our factor variable divided by quantiles in order to get four rates of crime:

## crime
##      low  med_low med_high     high 
##      127      126      126      127

Division of the dataset in train and test sets.

At first we use nrow() to count the number of rows in the dataset:

## [1] 506

then with ind <- sample() we randomly choose a 80% of them to create the train dataset. With the remaining material we create the test set.

4.2.5. Linear Discriminant Analysis (LDA).

In this section we fit a linear discriminant analysis on the train set, using the categorical crime rate as the target variable, and the other variables are the predictors. Here we can see the plot:

4.2.6.Class Prediction with LDA on the test data.

We will now run a LDA model on the test data, but before that we will save the crime categories from the test set and then we will remove the categorical crime variable from the test dataset.

Here is the cross tabulation of the results with the crime categories from the test set:

##           predicted
## correct    low med_low med_high high
##   low       18       9        1    0
##   med_low    7      17        4    0
##   med_high   0      11       14    0
##   high       0       0        0   21

Comments on the Results.

MISSING

4.2.7.Distance Measuring and Clustering of the Boston dataset.

To measure the distance between the observation, at first we standardize the dataset by using data.Normalization() with type = n1.

Then we run the distance between observations by using the function dist(), which utilizes the euclidean distance, the most common distance measure, then we use also the manhattan method:

After that, we calculate and visualize the total within sum of squares, by using set.seed(123) to prevent assigning random cluster centres, and setting the maximum number of clusters at 10.

Using the elbow method I think I will choose to go with three centers.

then we run the k-means(), I divide the plot in four to improve the clarity:

To be sure, I also try something different, for instance five.

RESULTS INTERPRETATION.

MISSING

BONUS

K-means and LDA on the original Boston data.

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Here, as before, is the procedure for the k-means with clusters > 2:

again, we run the k-means():

And here the LDA model, since the variable chas appeared to be a constant within group calls I removed it:

RESULT INTERPRETATION.

The most influential variable as cluster linear separator is the variable tax.

SUPER-BONUS.

In this section we run the code on the train scaled dataset to produce two 3D-plots.

In this first 3D-plot the color is given by the train$crime:

In this second 3D-plot the color is defined by the km$cluster:


Chapter 5: Dimensionality Reduction Techniques.

5.1.Data Wrangling.

The data wrangling for this week assignment was a continuation of the last week work on the Human dataset. The work for this week can be found here.

5.2.Data Analysis.

5.2.1.Data graphical overview and variables’ summary.

To visualize the data, we use both ggpairs() and corrplot():

library(GGally); library(corrplot); library(dplyr)
ggpairs(human_)

cor(human_) %>% corrplot(method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6, order = "hclus")

Comments on the outputs, and on variables’ relationships and distributions.

The observations are uneven, not all of them follow a normal distribution as the plot suggest, this may require a further standardization later on. Among the strongest inverse correlations we logically have Mat.Mor and Life.Exp. Also, Life.Exp and Ado.Birth have an inverse correlations. Edu.Exp, GNI and Edu2.FM all have a negative correlation with those two variables. As for the direct correlations, the strongest are Ado.Birth and Mat.mor, which shows that the adolescent birth rate is positively correlated with maternal mortality. Then, logically we see that Life.Exp, has a strong correlation with Edu.Exp GNI, and Edu2.FM. Also Edu.exp has a positive correlation with GNI and Edu2.FM. Some positive correlation is found also between this last one and GNI. As for the variables Labo.FM and Parli.F, these do not have significant statistical correlations with the other variables.

5.2.2.Principal Component Analysis (PCA) on not standardized data.

PCA is a statistical procedure that decomposes a matrix into a product of smaller matrices and revals the most important components. In other words, the data are transformed into a new space with equal or less number of dimensions. The first dimension reveals the maximum amount of variance from the features in the original data. As for the second component, it captures the maximum amount of variability left. The principal components are uncorrelated, and the importance of captured variance decreases for each of them. In this paragraph we run a PCA on the data visible in the following summary(). We will use the function prcomp()that employs the more numerically accurate Singular Value Decomposition (SVD).

##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Here we have the summary()for the principal component analysis:

## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000

Before proceeding to the plot, we set the rounded percentage of the variance captured by each principal component. In this case we choose to include only one digit, we create also the axis labels:

pca_pr1 <- round(100*s1$importance[2,], digits = 1)
pca_pr1
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0

Then we visualize the results with a biplot() that displays the observations and the original features (as a scatterplot), and also their relationship with both each others and with the principal components (in form of arrows or labels):

biplot(pca_human1, choices = 1:2, cex = c(1, 1.2), col = c("grey50", "deeppink2"), xlab =pc_lab1[1], ylab = pc_lab1[2])

The variability captured by the principal components are respectively of 100% in PC1 and of 0% in PC2, also all the others PCs are 0. This result could have been influenced by the non-standardization of the data as described more below.

5.2.3.Principal Component Analysis (PCA) on standardized data.

It is a good idea to standardize the data because the PCA is sensitive to the relative scaling of the original features. It also assumes that features with larger variance are more important than features with smaller variance. By standardizing variables with different units, we can compare the values of these variables. So, now we repeat the same analysis with standardized variables, whose summary() is the following:

##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850

and here we have thesummary()for the PCA:

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000

We follow again the same procedure as for the non-standardized data, for the rounding of digits and label creation and we obtain these results:

##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
biplot(pca_human2, choices = 1:2, cex = c(1, 1.2), col = c("grey50", "deeppink2"), xlab =pc_lab2[1], ylab = pc_lab2[2])

The results this time seem to be different. With the non-standardized data we had a PC1 that covered the 100% of the data and a PC2 with 0% as the other 6 PCs. In here, using standardized data, we have a PC1 that covers the 53,6% of the data, and then a PC2 that covers the 16,2%; the subsequent PCs are respectively 9.6%, 7.6%, 5.5%, 3.6%, 2.6%, and 1.3%.

5.2.4. Interpreting of the first two principal components dimensions.

By choosing the first few principal components we will have uncorrelated variables that capture the maximum amount of data variation. In the PCA with standardized data, The variability captured is shown in the axes X and Y. So we have a CP1 that captured the 53,6% variation and CP2 that captured the 16,2% variation. PCA is an unsupervised method due to the absence of criteria or target variable. It is a powerful method to encapsulate correlations between the original features into a smaller number of uncorrelated dimensions. As for the original variables represented by the arrows the Female Representation in Parliament and the Labo.FM are related and constitute the main contributor to the PC2 whilst Maternal Morality rate, and Adolescent Birth Rate are also related but contribute to the PC1. Also Life Exp. and Education Expectancy, Edu2.FM, and GNI are related between them and contribute to PC1. these three groups are almost orthogonal, and that indicates that they are uncorrelated.

5.2.5. Multiple correspondence analysis on tea dataset.

Multiple Correspondence Analysis can be used if the data consist of categorical variables as in this tea dataset. Here are shown dim(), str(), and summary() of the dataset variables.

Dataset exploration and visualization.

In this section we will use the dataset tea from the FactoMineR package. I keep only some variables: "Tea", "How", "how", "sugar", "where", "lunch". Here there are the dimension, and structure and summary of the newly composed tea_time dataset:

install.packages("FactoMineR", repos="https://cloud.r-project.org/")
## package 'FactoMineR' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\iper\AppData\Local\Temp\RtmpGOHq0r\downloaded_packages
library(FactoMineR); library(dplyr); library(ggplot2); library(tidyr)
data(tea)
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, one_of(keep_columns))
dim(tea_time)
## [1] 300   6
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 

Here we proceed to visualize the tea_time dataset with the following code:

gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Multiple Correspondence Analysis.

A MCA Biplot shows the possible variable pattern, the distance between them, gives a measure of their similarity. To run a MCA we use the following code:

library(MASS)
mca <- MCA(tea_time, graph = FALSE)

Here we can see the Scree plot:

## package 'factoextra' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\iper\AppData\Local\Temp\RtmpGOHq0r\downloaded_packages

To visualize the MCA variable togheter with the individuals I will use a different procedure, because sometimes the biplot is a bit chaotic due to high concentrations. First I find the number of categories per variables:

cats = apply(tea_time, 2, function(x) nlevels(as.factor(x)))
cats
##   Tea   How   how sugar where lunch 
##     3     4     3     2     3     2

Then I obtain the table of eigenvalues, the column and heads coordinates which will form the data frames:

mca$eig
##        eigenvalue percentage of variance cumulative percentage of variance
## dim 1  0.27937118              15.238428                          15.23843
## dim 2  0.26092645              14.232352                          29.47078
## dim 3  0.21933575              11.963768                          41.43455
## dim 4  0.18943794              10.332978                          51.76753
## dim 5  0.17722310               9.666715                          61.43424
## dim 6  0.15617745               8.518770                          69.95301
## dim 7  0.14375727               7.841306                          77.79432
## dim 8  0.14126310               7.705260                          85.49958
## dim 9  0.11717818               6.391537                          91.89111
## dim 10 0.08660997               4.724180                          96.61529
## dim 11 0.06205294               3.384706                         100.00000
head(mca$var$coord)
##                 Dim 1       Dim 2       Dim 3       Dim 4       Dim 5
## black      0.47266156  0.09385258 -1.08063006  0.56186080  0.55465156
## Earl Grey -0.26425693  0.12347079  0.43288074  0.01902771 -0.21547686
## green      0.48559492 -0.93257436 -0.10846539 -1.37121354  0.01644904
## alone     -0.01774999 -0.26156928 -0.11267800 -0.36538842 -0.39693332
## lemon      0.66914791  0.53068346  1.32935122  1.00939803 -0.37701185
## milk      -0.33671028  0.27164647  0.01305056  0.25378949  1.57343677
head(mca$ind$coord)
##        Dim 1      Dim 2      Dim 3      Dim 4        Dim 5
## 1 -0.2975376 -0.3278777 -0.3272610  0.3649751 -0.001711145
## 2 -0.2371102 -0.1360322 -0.6954324  0.2938098  0.818716502
## 3 -0.3689027 -0.3003457 -0.2015593 -0.1511548 -0.266254544
## 4 -0.5299062 -0.3182139  0.2113554  0.1571100 -0.306607147
## 5 -0.3689027 -0.3003457 -0.2015593 -0.1511548 -0.266254544
## 6 -0.3689027 -0.3003457 -0.2015593 -0.1511548 -0.266254544
mca_vars_df = data.frame(mca$var$coord, Variable = rep(names(cats),
                                                       cats))
mca_obs_df = data.frame(mca$ind$coord)

With these values I produced the following plot:

ggplot(data = mca_vars_df, aes(x = Dim.1, y = Dim.2, label = rownames(mca_vars_df))) +    geom_hline(yintercept = 0, colour = "gray70") + geom_vline(xintercept = 0,                        colour = "gray70") + geom_text(aes(colour = Variable)) + ggtitle("MCA plot of variables using R package FactoMineR")

We can develop the plot to include observations and categories as in a MCA biplot. Because of the overlapping of the individuals, there will be instead the use of density curves to see the zones of high concentration:

ggplot(data = mca_obs_df, aes(x = Dim.1, y = Dim.2)) + geom_hline(yintercept = 0, colour = "gray70") + geom_vline(xintercept = 0, colour = "gray70") + geom_point(colour = "gray50",
alpha = 0.7) + geom_density2d(colour = "gray80") + geom_text(data = mca_vars_df,
aes(x = Dim.1, y = Dim.2, label = rownames(mca_vars_df), colour = Variable)) + ggtitle("MCA plot of variables using R package FactoMineR") + scale_colour_discrete(name = "Variable")

Here we see the summary() of the mca, which will be commmented on, in the next section:

## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |

Comments.

In this output we can see the Eigenvalues, in other words the variances and the percentage of variances retained by each dimensions.

There are also individuals coordinates, and contribution to the dimension expressed in percentage, and the cos2, the squared correlations on the dimensions.

In the categories section, we see the coordinates of the varibale categories, the contribution in percentage, the cos2 and v.test value that follows the normale distribution. If it is below/above 1.96 + o - it is significantly different from zero. The categorical variables shows the squared correlation between each variable and dimensions. If the value is close to one it suggests a strong link with the variable and dimension.


Chapter 6: Analysis of Longitudinal Data.

In this last chapter, we will perform the analyses of the MABS at the book’s chapters 8-9. As for the chapter 8, we will utilize the RATS dataset, whilst for the chapter 9 we will use the BPRS dataset. Both dataset were transformed in long data format in the 6.1.1.Data Wrangling Exercise available here.

6.2.1. Analysis of Longitudinal Data I: Graphical Displays and Summary Measure Approach with the RATS dataset.

The RATS dataset comes from a nutrition study by Crowder and Hand (1990), on three groups of rats which were fed differently and had their weight recorded over a nine-week period. The main question is whether the growth profiles differ. After reading the table with the function read.table(), we factorize the variables "ID" and "Group", and we add the variable "Time", now we have the RATS in the long format RATSL:

RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", sep ="\t", header = TRUE)
library(dplyr); library(tidyr)
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
RATSL <- gather(RATS, key = WD, value = Weight, -ID, - Group) %>% mutate(Time = as.integer(substr(WD,3,4)))

Then we can see both the structure and the summary of long-format dataset RATSL:

## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : chr  "WD1" "WD1" "WD1" "WD1" ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...
##        ID      Group       WD                Weight           Time      
##  1      : 11   1:88   Length:176         Min.   :225.0   Min.   : 1.00  
##  2      : 11   2:44   Class :character   1st Qu.:267.0   1st Qu.:15.00  
##  3      : 11   3:44   Mode  :character   Median :344.5   Median :36.00  
##  4      : 11                             Mean   :384.5   Mean   :33.55  
##  5      : 11                             3rd Qu.:511.2   3rd Qu.:50.00  
##  6      : 11                             Max.   :628.0   Max.   :64.00  
##  (Other):110

We can also see the head() and tail() of the RATSL dataset:

##   ID Group  WD Weight Time
## 1  1     1 WD1    240    1
## 2  2     1 WD1    225    1
## 3  3     1 WD1    245    1
## 4  4     1 WD1    260    1
## 5  5     1 WD1    255    1
## 6  6     1 WD1    260    1
##     ID Group   WD Weight Time
## 171 11     2 WD64    472   64
## 172 12     2 WD64    628   64
## 173 13     3 WD64    525   64
## 174 14     3 WD64    559   64
## 175 15     3 WD64    548   64
## 176 16     3 WD64    569   64

Graphical Displays of Longitudinal Data.

The graphical display of longitudinal data is helpful in showing as much as raw data as possible and in unveiling different patterns such as aggregate, cross-sectional and longitudinal. It also allows for a simplier detection of unusual individuals or observations. Here we proceed in produce the first plot for all the rats in RATSL divided by the three groups:

library(tidyr); library(dplyr); library(ggplot2)
p1 <-ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=4))
p3 <- p2 + facet_grid(. ~ Group, labeller = label_both)
p4 <- p3 + theme_bw() + theme(legend.position = "none")
p5 <- p4 + theme(panel.grid.minor.y = element_blank())
p6 <- p5 + scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
p6

Figure 8.1: Individual “Weight” recordings by groups for the RATS data.

From this figure we can see that the weight has increased over the period of time for the large majority of the rats, more steeply in the "Group 2" and "Group 3". As for the "Group 1" the density of the lines makes the figure difficult to read but the weight seem to have increased in a less pronounced fashion. Each group also present an outlier. From the figure it emerges also that rats with higher values at the beginning tend to have higher values throughout the study,this is the so-called phenomenon of tracking. This was the visualization of the data as they were, now we will proceed to standardize the scores:

RATSL <- RATSL %>%
 group_by(Time) %>%
  mutate(stdWeight = (Weight - mean(Weight))/sd(Weight)) %>%
  ungroup()
glimpse(RATSL)
## Observations: 176
## Variables: 6
## $ ID        <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ Group     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1...
## $ WD        <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD...
## $ Weight    <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 44...
## $ Time      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8...
## $ stdWeight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8...

After the standardization we draw the plot that visualizes the standardized "Weight":

p1 <-ggplot(RATSL, aes(x = Time, y = stdWeight, linetype = ID))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=4))
p3 <- p2 + facet_grid(. ~ Group, labeller = label_both)
p4 <- p3 + theme_bw() + theme(legend.position = "none")
p5 <- p4 + theme(panel.grid.minor.y = element_blank())
p6 <- p5 + scale_y_continuous(name = "standardized Weight")
p6

Figure 8.2: Individual “Weight” recordings by groups for RATS data after standardization.

The phenomenon of tracking is more visibile from this graphic. Still, this type of visualization is not very clear. Let’s then proceed to some alternative visualization options.

Summary Measure Analysis of Longitudinal Data.

Within the summary measure methods there are better visualization options. The first consists in the visualization of the average weight for each group together with the indication of the variations of the observation at each timepoint. We will see this option in the Figure 8.3. A plotting alternative is a graph side-by-side boxplot of the observations at each time point, as in the Figure 8.4.

Here we have the amount of Time, baseline ("WD1") included:

n <- RATSL$Time %>% unique() %>% length()

The following function produces a summary data:

RATSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise(mean = mean(Weight), se=sd(Weight)/sqrt(n)) %>%
  ungroup()
glimpse(RATSS)
## Observations: 33
## Variables: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, ...
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 26...
## $ se    <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.5529...
RATSS
## # A tibble: 33 x 4
##    Group  Time  mean    se
##    <fct> <int> <dbl> <dbl>
##  1 1         1  251.  4.59
##  2 1         8  255   3.95
##  3 1        15  254.  3.46
##  4 1        22  262.  4.10
##  5 1        29  265.  3.33
##  6 1        36  265   3.55
##  7 1        43  267.  3.30
##  8 1        44  267.  3.07
##  9 1        50  270.  4.39
## 10 1        57  272.  3.24
## # ... with 23 more rows

which is then used to obtain the plot for mean Weight:

p1 <- ggplot(RATSS, aes (x = Time, y = mean, linetype = Group, shape = Group))
p2 <- p1 + geom_line() + scale_linetype_manual(values = c(3:5))
p3 <- p2 + geom_point(size=3) + scale_shape_manual(values = c(3:5))
p4 <- p3 + geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3)
p5 <- p4 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p6 <- p5 + theme(legend.position = "right")
p7 <- p6 + scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
p7

Figure 8.3: Mean “Weight” by groups for the RATS data.

Apart from the different values in Weight, we can see that especially the Group 2 and the Group 3, present similar dynamics in weight gaining. The Group 1, has a less steep weight gain and does not show the sudden drop after WD40 as in the other two groups.

Let’s now proceed to visualize the side-by-side boxplot, which is created by executing the chunk below:

p1 <- ggplot(RATSL, aes(x = factor(Time), y = Weight, fill = Group))
p2 <- p1 + geom_boxplot(position = position_dodge(width = 0.9))
p3 <- p2 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p4 <- p3 + theme(legend.position = "right")
p5 <- p4 + scale_x_discrete(name = "Time")
p6 <- p5 + scale_fill_grey(start = 0.5, end = 1)
p6

Figure 8.4: Boxplot for the RATS data.

In here it is possible to note the outlier values for each group, of which, in the "Group 2" is present the highest weight. The outlier in the Group 1 shows the lowest weight, whilst in the "Group 3" we find a serie of outliers that falls in the IQR of the “Group 2”```. This visualization option is certainly clearer than those represented by the Figure 8.1 and Figure 8.2.

Applying the Summary Measure Approach.

In here we make a summary data by using the mean of the "Time" variable:

RATSF <- RATSL %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise(mean = mean(Weight)) %>%
  ungroup()
glimpse(RATSF)
## Observations: 16
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, ...

and then we create a boxplot of the measure for each group:

library(ggplot2)
p1 <- ggplot(RATSF, aes(x = Group, y = mean))
p2 <- p1 + geom_boxplot()
p3 <- p2 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p4 <- p3 + stat_summary(fun.y = "mean", geom = "point", shape = 23, size = 4, fill = "white")
p5 <- p4 + scale_y_continuous(name = "mean(Weight), Time > 1")
p5

Figure 8.5: Boxplot of mean summary measures for the three groups for the RATS data.

From the figure we notice that the plots for the "Group 2" and the "Group 3" are rather skewed and they present more variable means. All the group present an outlier, but especially in the "Group 2" the value is so high that it can bias the result.

For this reason I remove the outliers, in a rather rudimental way, after checking the values I filter them one by one:

RATSFS1 <- RATSF %>%
  filter(mean < 590)%>%
  filter(mean > 239)%>%
  filter(mean != 495.2)

glimpse(RATSFS1)
## Observations: 13
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3
## $ ID    <fct> 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 15, 16
## $ mean  <dbl> 263.2, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, ...

Then with the data cleared from the outliers in each class, I proceed to create the new boxplots:

p1 <- ggplot(RATSFS1, aes(x = Group, y = mean))
p2 <- p1 + geom_boxplot()
p3 <- p2 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p4 <- p3 + stat_summary(fun.y = "mean", geom = "point", shape = 23, size = 4, fill = "white")
p5 <- p4 + scale_y_continuous(name = "mean(Weight), Time > 1")
p5

Figure 8.6: Boxplot of mean summary measures for the three groups for the RATS data, without the ouliers.

Without the outliers the boxplots are much less skewed, still in the "Group 2" and "Group 3" the means have more variability than in the "Group 1". The distributions in the three groups seem to be a bit different, I am interested in "Group 2" and "Group 3". We shall then use a t-test to check this statement.

We will use the RATSFS1 data without outliers leaving out the "Group 1":

RATSFS1 <- RATSFS1 %>%
  filter(RATSFS1$Group != 1) 

glimpse(RATSFS1)
## Observations: 6
## Variables: 3
## $ Group <fct> 2, 2, 2, 3, 3, 3
## $ ID    <fct> 9, 10, 11, 14, 15, 16
## $ mean  <dbl> 443.9, 457.5, 455.8, 536.4, 542.2, 536.2

So now we run the t.test():

t.test(mean ~ Group, data = RATSFS1, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -18.235, df = 4, p-value = 5.32e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -98.94088 -72.79246
## sample estimates:
## mean in group 2 mean in group 3 
##        452.4000        538.2667

The two groups show differences as the report indicates.

I also wanted to add the baseline weight to the RATSLFS1 to add a bit more accuracy. However it was not possible as the baseline resulted NULL.

baseline <- RATS$Time
RATSFS2 <- RATSFS1 %>%
  mutate(baseline)

This also prevented to fit the ANCOVA model:

#fit <- lm(mean ~ baseline + Group, data = RATSFS2)

#summary(fit)
#anova(fit)

6.2.2. Analysis of Longitudinal Data II: Linear Mixed Effects Models for Normal Response Variables with the BPRS data.

The BPRS dataset comes from Davis(2002), and it is about 40 male subjects whom were randomly assigned to one of two treatment groups and each individual was rated on the brief psychiatric rating scale (BPRS) before the beginning of the treatment and after the conclusion of it. The scaled measures the levels of 18 symptoms from one to seven and it is used to assess the presence of schizophrenia. After reading the BPRS dataset, I factorized the variables "treatment", and "subject" and produced the long-format data BPRSL.

Whose structure and summary() are the following:

## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : chr  "week0" "week0" "week0" "week0" ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
##  treatment    subject       weeks                bprs            week  
##  1:180     1      : 18   Length:360         Min.   :18.00   Min.   :0  
##  2:180     2      : 18   Class :character   1st Qu.:27.00   1st Qu.:2  
##            3      : 18   Mode  :character   Median :35.00   Median :4  
##            4      : 18                      Mean   :37.66   Mean   :4  
##            5      : 18                      3rd Qu.:43.00   3rd Qu.:6  
##            6      : 18                      Max.   :95.00   Max.   :8  
##            (Other):252

And here are also the head() and tail() of the BPRSL:

##   treatment subject weeks bprs week
## 1         1       1 week0   42    0
## 2         1       2 week0   58    0
## 3         1       3 week0   54    0
## 4         1       4 week0   55    0
## 5         1       5 week0   72    0
## 6         1       6 week0   48    0
##     treatment subject weeks bprs week
## 355         2      15 week8   37    8
## 356         2      16 week8   22    8
## 357         2      17 week8   43    8
## 358         2      18 week8   30    8
## 359         2      19 week8   21    8
## 360         2      20 week8   23    8

Linear Mixed Effects Methods for Repeated Measures Data.

Fitting the Independence Model to the BPRS Data.

In here we produce the plot of "bprs" against Week for the two treatment groups:

p1 <-ggplot(BPRSL, aes(x= week, y = bprs, group = treatment))
p2 <- p1 + geom_text(aes(label = subject))
p3 <- p2 + scale_x_discrete(name = "Week")
p4 <- p3 + scale_y_discrete(name = "bprs")
p5 <- p4 + theme_bw()
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p6

Figure 9.1: Plot of bprs against weeks for treatments of the BPRSL data.

The figure is not particularly helpful in assigning each observation to its related treatment group because it contains a lot of observations. Nevertheless, it still retain some pattern which is however hard to observe. Let’s then move to fit a linear mixed model.

Fitting Linear Mixed Models to the BPRS Data.

Now we will produce a visualization of "bprs" values for each "subject", at first we do a linear regression model to BPRSL with bprs as response variable and treatment and week as explanatory variables:

BPRS_reg <-lm(bprs ~ weeks + treatment, data = BPRSL)
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ weeks + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.714  -9.226  -2.989   6.786  48.389 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  47.7139     2.0590  23.173  < 2e-16 ***
## weeksweek1   -1.6750     2.7624  -0.606  0.54467    
## weeksweek2   -6.3000     2.7624  -2.281  0.02317 *  
## weeksweek3   -8.8500     2.7624  -3.204  0.00148 ** 
## weeksweek4  -11.6500     2.7624  -4.217 3.15e-05 ***
## weeksweek5  -15.4500     2.7624  -5.593 4.51e-08 ***
## weeksweek6  -16.7750     2.7624  -6.073 3.28e-09 ***
## weeksweek7  -15.8000     2.7624  -5.720 2.29e-08 ***
## weeksweek8  -16.5750     2.7624  -6.000 4.92e-09 ***
## treatment2    0.5722     1.3022   0.439  0.66063    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.35 on 350 degrees of freedom
## Multiple R-squared:  0.2026, Adjusted R-squared:  0.182 
## F-statistic: 9.878 on 9 and 350 DF,  p-value: 1.62e-13

After this LRM we proceed to visualize the plot of the "treatment" group’s"bprs" over time:

p1 <- ggplot(BPRSL, aes(x = week, y =bprs, group = subject))
p2 <- p1 + geom_line(aes(linetype = subject))
p3 <- p2 + scale_x_continuous(name = "Week")
p4 <- p3 + scale_y_continuous(name = "bprs") 
p5 <- p4 + theme_bw() + theme(legend.position = "right")
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p6

Figure 9.2: Plot of individual bprs over time.

The plot looks quite chaotic, but it shows the the decrease of the bprs scores over "week"s. As this was about the subjects, we can do another plot with the "treatment" that would result more clear.

library(ggplot2)
p1 <- ggplot(BPRSL, aes(x = week, y =bprs, group = treatment))
p2 <- p1 + geom_line(aes(linetype = treatment))
p3 <- p2 + scale_x_continuous(name = "Week")
p4 <- p3 + scale_y_continuous(name = "bprs") 
p5 <- p4 + theme_bw() + theme(legend.position = "right")
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p6

Figure 9.2bis: Plot of treatment groups’ bprs over time.

The "treatment 1" seem to have an increased "bprs" at the end of the 8 "week"s, whilst the "treatment 2" shows lower values. In both groups the central phase between 2 and 6 presents the lower "bprs" values which then raise again towards the end even though not as high as the end of the "week" 1 for the group 2.

pairs(BPRS, cex = 0.8)

Figure 9.3: scatterplot of repeated measures in subjects bprs data.

Here we have the results from Fitting Random Intercept Model with week and treatment as explanatory variables to the BPRS data:

library("lme4")
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

And in here there are the results from Fitting the Random Intercept and Slope Model, with week and treatment as explanatory variables in the BPRS data:

BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7977 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9803   -0.51
##  Residual             97.4304  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000

This is the application of the ANOVA test:

anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## BPRS_ref   5 2748.7 2768.1 -1369.4   2738.7                           
## BPRS_ref1  7 2745.4 2772.6 -1365.7   2731.4 7.2721      2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here are the results from Fitting the Random Intercept and Slope Model that allows for a treatment X week interaction to the BPRS data:

BPRS_ref2 <-lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840

We can then plot the fitted bprs for each subject:

Fitted <- fitted(BPRS_ref2)
BPRSL <- BPRSL %>% mutate(Fitted)

p1 <- ggplot(BPRSL, aes(x = week, y = bprs, group = subject))
p2 <- p1 + geom_line(aes(linetype = subject))
p3 <- p2 + scale_x_continuous(name = "week")
p4 <- p3 + scale_y_continuous(name = "bprs")
p5 <- p4 + theme_bw() + theme(legend.position = "right")
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p7 <- p6 + ggtitle("Observed")
graph1 <- p7

p1 <- ggplot(BPRSL, aes(x = week, y = Fitted, group = subject))
p2 <- p1 + geom_line(aes(linetype = subject))
p3 <- p2 + scale_x_continuous(name = "week")
p4 <- p3 + scale_y_continuous(name = "bprs")
p5 <- p4 + theme_bw() + theme(legend.position = "right")
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p7 <- p6 + ggtitle ("Fitted")
graph2 <- p7

graph1; graph2

Figure 9.4: Fitted bprs profiles from the interaction model and observed bprs profiles.